Porting Smalltalk to the Amstrad CPC: some progress

Posted by pulkomandy on Tue Mar 19 21:15:18 2024 • Comments (0) •

If it doesn't work, keep trying

My previous article on porting Little Smalltalk to the Amstrad CPC received some attention and support messages from other people, who either find it useful, or just interesting to read. I got a comment from someone who was happy that I am doing retrocomputing and not just retrogaming. I'll take that as a compliment and I think it is a somewhat common theme in the Amstrad CPC scene.

As you can probably guess from the previous post on garbage collectors, I am still thinking about this project and trying to figure out a way to make things work.

First, I noticed that I had made a mistake in my first attempt: by not looking closely enough, I assumed that the image file would be somewhat specific to the word size and endianness of the target machine, leading me into attempting to build the image directly on the CPC. A closer look at the code made me realize that this is not the case. The stored image format is generic and does not change depending on endianness and word size.

So, that means I can throw away my work on porting the image builder, and focus on the VM itself. But I was still curious why the built image was more than 100 kilobytes. Surely there is not *that* much code in Little Smalltalk?

I started taking a closer look at the image file in an hex editor. And I noticed something that I should have found much earlier, especially as it was hinted at in the Little Smalltalk book. The image contains not only the bytecode, but also the sourcecode to all methods. This allows Little Smalltalk to not include a disassembler. You can open the sourcecode in an editor, edit it as you want, and then recompile it to create a new bytecode. But the price to pay is a lot of used up space in the image. I think later I will write (or port) a bytecode disassembler (I think there is one in one of the later versions/forks of Little Smalltalk?). For now, I disabled the sourcecode generation in the ImageBuilder, and it resulted in a much smaller image of 49KB. Now that's a size I have at least some chance of fitting into the CPC RAM! So let's try that.

The porting of the Smalltalk VM to the CPC is pretty similar to what I did for the ImageBuilder. I compile the C code, set up a linker script, replace file IO with AMSDOS routines which are much simpler than POSIX, but should still be enough for our needs at least to load the initial image. I encounter similar problems in the image loading than in the image saving (too much stack usage by recursive functions) and apply the same workaround (moving the stack to a different place before calling the heavily recursive functions).

Anyway, I got it running. Sort of. And then I ran out of memory again.

Memory layout of the CPC

To understand the problem (and the next steps), it is useful to understand how memory is addressed on the CPC. First, remember that the machine uses a Z80 CPU, and so, it can address 64KiB of memory using 16-bit addresses. On the CPC this is organized as such (rounded to the next 256th byte for simplicity):

0000-0100: Reset vectors and "low jumpblock" from the firmware. This is the handlers for the RST instruction, interrupts, etc.
0100-A700: Available for applications.
A700-BFFF: Used by the OS. Variables from the disk ROM, indirections for various system routines that are actually implemented in ROM, etc.
C000-FFFF: video RAM

So we have about 42KiB of RAM free for our application. Initially I tried to fit both the interpreter code and the Smalltalk heap (and any variables needed for the interpreter) into this space, and it didn't work. I removed various parts of the interpreter, I replaced the garbage collector with a stupid one that can't free any memory (this particular idea being brought up by one of the papers on garbage collectors I read, apparently some people were using their LISP machines this way for performance reasons and it was used also as a benchmark of the fastest possible GC algorithm).

Anyway, even with just half an interpreter, I still was out of memory rather quickly. So it's time to find other approaches.

Mocing the code to a ROM

The Amstrad CPC OS is well designed to allow running code from expansion ROMs. These ROMs are mapped at C000-FFFF (overlapping the screen memory), and can easily call into the system ROM (which maps at a different address) and also call into other ROMs as neede (for example, the AMSDOS ROM for disk access).

To put our code in ROM, we need some changes to our linker script and CRT0. I based this on scripts I wrote earlier while making a ROM version of the Contiki operating system.

-mwx
-i allinone.ihx
-b _INITIALIZED=0x0100
-b _CODE = 0xC000
-k /bin/../data/sdcc/lib/z80
-k /packages/sdcc-4.4.0-1/.self/data/sdcc/lib/z80
-l z80
crt0_rom.rel
allinone.rel

-e

The main change here is that _CODE is now at 0xC000, and we also add that _INITIALIZED is at 0x0100. _INITIALIZED is where the RAM area starts, the other RAM sections are put right after it. So now the linker will put our code in ROM, and our data in RAM.

Of course we also need to adjust the CRT0 file.

;; FILE: crt0.s
;; Generic crt0.s for runing code as a CPC ROM.

  .module crt0
        .globl  _main

        .globl _progend

        ;; Ordering of segments for the linker.
        ; Things that go in RAM
        .area _INITIALIZED

        .area _DATA
        .area _BSS
    .area _HEAP
_progend::

        ;; Things that go in ROM
        .area _HOME
        .area   _CODE (REL,CON)
        .area _INITIALIZER (REL,CON)
        .area   _GSINIT (REL,CON)
    .area   _GSFINAL (REL,CON)

        .area   _CODE (REL,CON)

	    ;; The _CODE area starts with the ROM header.
		;; This has some version markers, a ROM type (background here, meaning
		;; it will not immediately take control of the system during boot),
		;; And a table of "RSXs", that is, commands that can be called from
		;; BASIC or ASM code by name.
        .db #1  ; Background ROM
        .db #4  ; Version 4.0
        .db #0
        .db #0
        .dw rsx_table

		;; The first routine is called during boot, in our case we do nothing
		;; at that point. The second is invoked by the name defined in the
		;; table below (|ST in our case).
        jp do_nothing
        jp init

rsx_table:

        .db 'T' + #0x80
        .ascii "S"
        .db 'T' + #0x80
        .db #0

init:
; This is the program entry point. Execution starts here
;; Initialise global variables, clear BSS areas
;; Note: not used in Smalltalk, I removed all initialized variables.
;; Probably still a good idea to clear the BSS part so all global variables are
;; initially set to 0...
;       ld      bc, #l__INITIALIZER
;       ld      a, b
;       or      a, c
;       jr      Z, gsinit_next
;       ld      de, #s__INITIALIZED
;       ld      hl, #s__INITIALIZER
;       ldir
;gsinit_next:
; Clear BSS sections
;       ld hl,#s__DATA
;       ld (hl),#0
;       ld de,#s__DATA + #1
;       ld bc,#s__DATA + #0x200
;       ldir

; Initialize disk ROM. We do the simplest thing here and just initialize ROM 7.
; It wouls be more compatible to save the currently active drive and see if
; there are other ROMs to initialize.
        ld hl,#0xabff
        ld de,#0x40
        ld c,#7
        call 0xbcce

        ; Init the heap. Do it after the disk ROM is ready, because that may change HIMEM
		; Not needed: no malloc/free is used.
        ; call __sdcc_heap_init

; Enter the C main.
        call    _main
        ; TODO figure out how to return to BASIC cleanly (we probably erased a bit much)
endless:
_exit::
        jr endless

do_nothing:
        scf
        ret

That's all we need, we can now run our C code from ROM. If it fits there, that is.

Making the code a bit smaller

At this point I had about 20-24K of code. A ROM is 16K, and, for now, I don't want the headache of switching between different ROMs for different parts of the code (maybe later, if I can't fit it all in one single ROM). So, I commented out more code from the interpreter and garbage collector and started simplifying some things. There is no difference anymore between static and dynamic allocations. All integer math code is removed as well as most other things. Also a big way to save space was replacing printf with simpler custom functions that don't do much custom formatting. I have way to print 16 and 8 bit values, NULL terminated strings, and "Pascal" style strings with a length known in advance. That is good enough, and it saved several kilobytes to not need printf's % parsing, decimal integer formatting, etc.

After this, I could finally try to run my VM again. And... it ran out of memory.

I guess that makes sense. Even after moving the VM code out of the way in a ROM, I still get only 42K of RAM free. And that has to hold not only the Smalltalk image data, but also all variables needed by the interpreter, the AMSDOS file buffers, etc. Well, it doesn't fit.

But, if you know a bit about the CPC, you have noticed the existence of the CPC 6128, a machine with twice as much RAM as what I have explained so far. How does this work? Well, the CPU cannot access any more RAM directly. So, there is a special hardware register to map extra RAM pages. There are various ways to use this, they are somewhat handcrafted to run CP/M Plus, a modified version of CP/M that was shipped with the machine and put the user code all in expansion memory, allowing applications up to 63K of RAM usage, with the video memory, and all system internals remaning in the main 64K of RAM.

Anyway, I decided to do something simple, and the simplest way to handle this memory in a way that's friendly to the CPC firmware is to map it by blocks of 16K at addresses 0x4000-0x7FFF. In my previous article about garbage collectors, I considered some more advanced options, but in the end I decided to keep things as simple as possible: all the Smalltalk heap and objects will live in this 4000-7FFF space, and I can keep the remaining parts of RAM (0100-3FFF and 8000-A7000) for the internal needs of the interpreter and GC (which will both need some variables and structures).

Of course, this create a new difficulty: the C compiler isn't designed to work with such memory banks. So, I will have to explicitly map the right memory page before accessing these variables. There are again multiple ways to handle this (so many choices...). For example, I could use a 16-bit address referring to a position in the 64K of extended RAM. But that would prevent having anything more than 64K of extended RAM. There are memory expansions for the CPC, and if just loading the base image already requires nearly 50K, I should rather prepare for use of one of these.

So, I will instead use indirect pointers. The idea is that pointers to objects can be represented as such:

struct indirection {
        unsigned char bank;
        struct object* ptr;
};

The "bank" here is the value to send to the Gate Array (the Amstrad CPC memory management chip) to map the correct RRAM page. For now we have to know about the vlaues C4, C5, C6 and C7 (these allow access to the 64K of extended memory) as well as C8 (this will put the 4000-7FFF from the base RAM back where it was).

The code to dereference an indirect pointer is like so:

__sfr __banked __at(0x7F00) GateArray;
// __sfr means this is accessed with the IN and OUT instructions of the Z80, instead of RAM access instructions
// __banked means the address is still 16bit (normally for IN and OUT it should be 8 bits, but manufacturers of hardware using the Z80 didn't respect this)
// __at(0x7f00) forces the variable to be at that address
// This means we can now write:
// GateArray = xx;
// And this will generate the appropriate assembler code:
// LD BC,0x7f00
// LD A,xx
// OUT (C),A

struct object* DEREF(struct indirection* x)
{
	    // Get the pointer first, in case the indirection is itself stored in
	    // the currently active bank (I didn't use that in the end, so maybe I
	    // can simplify this)
        struct object* o = x->ptr;
        GateArray = x->bank;
        return o;
}

I then modified the Smalltalk object structure to contain "indirections" instead of pointers to other objects. And I also added a lot of calls to DEREF all over the code of the interpreter, to map the correct memory page before accessing anything. As a first step, I didn't try to be smart about it (well, of course I did, then I had a lot of bugs, and I stopped trying to be smart, and then it worked better). So, I made a rule to not have any direct object* pointer stored anywhere in the interpreter code, only indirection* and calls to DEREF over and over. This will not be fast, as every other instruction will be switching to a different memory bank (or uselessly activating a bank that is already activated). But let's try to make it work this way first.

Starting from here, I have two choices. I can store indirections directly inside objects, or separately. Basically, one of these two:

// Indirection stored directly into the object
struct object {
        uint size;
        struct indirection class;
        struct indirection data[0];
};

// Indirection stored separately
struct object {
        uint size;
        struct indirection *class;
        struct indirection *data[0];
};

As you probably guessed by the fact that I named this structure "indirection", I went with the second option. The reason for this is that it is much easier to code when object sizes are in multiple of 2 bytes, and not in "multiples of 3 + 2". And also because you would have to be very careful to store the destination pointer before connecting the new bank, if the indirection itself was stored in a bank.

So, now, I have a big table of indirections living somewhere in 0100-3FFF. This table is filled with pointers to objects in the banked RAM. It creates a limitation to about 5000 objects in the system (16K/3). I have not hit this limit yet, but the initial image is already about half of that. We'll see how that goes.

This indirection also may simplify work for the garbage collector: it allows to easily move objects from one address to another, since all references to an object will be through the indirection table: just update the indirection table entry, and you're done. I think the garbage collector will benefit from that simplicity of not having to crawl through all other objects to find references to the one being moved.

Another advantage, mainly for debugging, is that there are now a lot of constraints on what a valid pointer is. An indirection must have an address in the 0100-3fff range. The pointed object must have a bank in the C4-C8 range, and a pointer in the 4000-7FFF value range. I have added some checks to my DEREF function to validate this. If the code attempts to dereference an invalid reference, execution stops with an error message.

Finally, these contraints also mean we have several unused bits, both in the pointers and in the indirections themselves. These are useful to store extra information. In indirection pointers, the high bit being set means that the value is not an indirection, but a direct 15-bit integer value. In indirections, unused bits in the bank or in the pointer fields can be used by the garbage collector to track which objects have already been garbage collected (the original code also did similar things, but using the size field in the objects themselves for the gc, and the lower bits of the pointers to mark integers, relying on the fact that allocations are aligned to 2 bytes and so for pointers that bit is always 0).

So, after all these changes and a bit of debugging, we're getting somewhere. The full image can be loaded into banks, and then execution starts. Unfortunately, it seems to be confused after a short time, returning to the wrong place after executing a "block2 (blocks are similar to lambdas in other programming languages).

So, I have more debugging to do. I have to keep a balance between adding more debug statements in the code, and not overrunning my budget of 16K of code in a single ROM. For now, I do not want to leave C and start rewriting things in assembler, because it would make it harder to change things and try new ideas. I hope I can delay that until the code is mostly running and there are at least some parts I'm sure won't change anymore. Then I can rewrite or hand optimize these first, and save enough space for debugging the other parts. Also, working in the ACE emulator has been very helpful, setting breakpoints in strategic places allowed me to find and fix many of the problems.

I continued debugging, identifying and fixing various issues. For example, if I use the DEREF function twice in the same line of code, like so:

DEREF(x) = DEREF(y);

that doesn't work right, because the = operator is not a sequence point. That means the bank activations for each side will not happen in the right order for things to be read and written where they should. So this must always be done through a temporary variable, which ensures proper sequencing.

Other mistakes include too many or not enough calls to DEREF, or simply writing code that doesn't behave the same way as the original while adding the DEREF calls. Eventually, I got it to run! Well, sort of.

Running out of space (again)

One of the very first things the Smalltalk image does is creating the 256 "Char" objects for each of the ASCII characters. Each of them is a full-blown object, using up 6 bytes in the heap. This saves a lot of special-casing in the VM. But, the loop to create them ends up creating a lot of objects and filling up my indirection array. Eventually the indirection array grows into the 4000-FFFF space, overwrites some objects there, and things crash badly in various ways. So maybe this indirection was not such a great idea. Or maybe I need to start thinking about the garbage collection, and see how many entries in that array it would manage to free. But the problem with that is, at the moment, I have less than 128 bytes left in my ROM, so a GC is not going to fit. Unless I start removing debugging traces and optimizing some code in assembler. Which I don't want to do until I'm sure the indirection table approach is working.

Well, that's all for today. See you in the next article in the series...

PulkoMandy's homepage